Part 1 of this report aims to answer the research question: how does the distribution of coral bleaching vary across latitude zones during the 2014–2017 global bleaching event? To achieve this, an interactive scatter map visualization is created, which finds that across all years of the event, the highest percentage of bleaching occurred in the near-equatorial and equatorial zones. In particular years, this fact remains true for 2014 and 2016. In 2015, however, bleaching was highest in tropical midlatitudes, and in 2017, bleaching peaked in tropical lowlatitudes.
Part 2 of this report aims to answer the research question: how do the accuracy of bleaching predictions from current vs four year old variables compare? To achieve this, multiple machine learning (ML) models were trained and deployed. The findings were that models built on the contemporary dataset, had a slightly higher average prediction accuracy (45%) than the four years prior dataset (43%).
This section investigates a statement made by the authors of a study (S. Sully 2019) that examines coral bleaching events from 1998 to 2017 across in 81 countries. They claim that “the highest probability of coral bleaching occurred at tropical midlatitude sites (15–20 degrees north and south of the Equator)”.
2.2 Formulating problem
From this claim, we derive the following research question and hypothesis:
Research question: How does the distribution of coral bleaching vary across latitude zones during the 2014–2017 global bleaching event?
Hypothesis: The tropical midlatitude zone (15–20°) will show the highest average bleaching levels compared to other latitude zones between 2014 and 2017.
The 2014-2017 period is of special interest, as it represents on of the most intense coral bleaching events on record.
2.3 Exploratory Data Analysis
2.3.1 Data cleaning
The various environmental measurements recorded throughout the study - such as temperature, and average bleaching - are contained in an associated dataset. The data stores location as latitude and longitude, which can be transformed into the following five equatorial zones (Meteoblue n.d): Equatorial (0-5°), Near-Equatorial (5-10°), Tropical Lowlatitude (10-15°), Tropical Midlatitude (15-20°), Tropical Highlatitude (20-25°), and Subtropical (25-35°).
Missing values, denoted by ‘NA’, are also removed, to ensure subsequent operations occur on a complete and consistent dataset. These values could represent sensor failure or missed data.
Before assessing the paper’s claim about bleaching, the distribution of bleaching in the dataset must be investigated. Below is a boxplot visualizing the distribution of Average bleaching in each of the latitude zones that were defined in the earlier section. Average_bleaching is a measure of the bleaching percentage of a particular reef. The y axis, ‘Log Average Bleaching’, is a log transformation of Average_bleaching. This reduces variable’s range from 100 to 2, neatening the plot. It is evident from the graph, that the average bleaching is relatively evenly distributed across the zones.
Code
ggplot(p1data, aes(x = Latitude_Zone, y = Average_bleaching)) +geom_boxplot(outlier.shape =NA, fill ="white", color ="black") +scale_y_log10() +labs(title ="Distribution of Average Bleaching by Latitude Zone",x ="Latitude Zone", y ="Log Average Bleaching (%)") +theme_minimal() +theme(axis.text.x =element_text(angle =30, hjust =1),plot.title =element_text(face ="bold") )
2.4 Executing solution
2.4.1 Initial visualization
The following is a barplot of the average bleaching in each Latitude Zone from 2014 to 2017. The general trend is that bleaching percentage generally decreased from 2014-17, with the Near-Equatorial zone having the highest bleaching among all zones in 2016 and 2017. In addition, Tropical Lowlatitiude and Midlatitudes peaked in 2014 and 2015, respectively.
Code
ggplot(p1data, aes(x = Year, y = Average_bleaching, fill = Latitude_Zone)) +geom_col(position ="dodge") +scale_fill_brewer(palette ="Set2") +labs(x ="Year", y ="Average Bleaching (%)",fill ="Latitude Zone", title ="Average Bleaching by Latitude Zone (2014–2017)") +theme_minimal() +theme(plot.title =element_text(face ="bold") )
2.4.2 Interactive visualization
While a barplot is useful, in order to more intuitively detect patterns of bleaching among the latitude zones, we can use a scatter map visualization. Below is an interactive geoscatter visualization, adapted from example Plotly code (plotly Graphing Libraries 2025). We load the map data from plotly, which gives us the canvas to ‘draw’ bleaching points on. The points are formatted, given a color scale, and then filters for year, zone, and average bleaching are added to the top of the plot.
From the plot as default, it is evident that the highest percentage of bleaching occurred in the near-equatorial and equatorial zones. Adjusting the ‘year’ filter, it is clear that this fact remains true for years 2014 and 2016. In 2015, however, bleaching was highest in tropical midlatitudes, concentrated around the east coast of australia. In 2017, bleaching peaked in tropical lowlatitudes.
Insights from the visualizations, do not fully support the hypothesis. The tropical midlatitude had the highest average bleaching only once out of the four years, in 2015, with equatorial zones having higher peaks in 2014 and 2016, and tropical lowlatitudes dominating in 2017. This variability suggests that the claim made — that midlatitudes experienced the highest probability of bleaching — does not hold uniformly during the 2014–2017 bleaching event. Several factors could explain this inconsistency, which are explored below.
2.5.2 Statistical significance
There are clearly differences in bleaching between across the latitude zones, but are these differences statistically significant? I.e: did they occur beyond pure chance? To ascertain this, we could perform an ANOVA test. This would compare the average bleaching across each zone to determine whether the differences are significant.
2.5.3 Temporal correlation
In addition, treating and analysing each year of bleaching as if they are independent from the others may be incorrect. To investigate this, a Durbin Watson test (Kenton 2024) was performed on a linear model of average bleaching against year. The Durbin-Watson (DW) statistic obtained was 0.003, indicating a strong positive autocorrelation. This suggests bleaching values in one year are highly correlated with values in adjacent years. Therefore, treating each year’s bleaching data as independent observations may lead to incorrect conclusions.
Code
dwtest(lm(Average_bleaching ~ Year, data = p1data))
Durbin-Watson test
data: lm(Average_bleaching ~ Year, data = p1data)
DW = 0.0033551, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
2.5.4 Sample distribution
In addition, not all latitude zones were equally sampled in the dataset. The tropical lowlatitude, highlatitude, and subtropical zones were particularly under-sampled compared to others, possibly due to the lower prevalence of reefs in those regions. This uneven distribution of observations could act as a confounding factor, which limits the accuracy of comparisons across zones. That is, if one zone has significantly more measurements than another, we cannot be confident that observed differences in bleaching are wholly due to the zone itself, instead of differences in sample size.
Code
plot1 <-ggplot(p1data, aes(x = Latitude_Zone)) +geom_bar(fill ="darkgray") +labs(title ="Observations per Latitude Zone",x ="Latitude Zone", y ="Count") +theme_minimal() +theme(axis.text.x =element_text(angle =30, hjust =1))plot2 <-ggplot(p1data, aes(x = Year)) +geom_bar(fill ="darkgreen") +labs(title ="Observations per year",x ="Year", y ="Count") +theme_minimal() plot1 + plot2 +plot_layout(guides ='collect', ncol =2, nrow =1) +plot_annotation(tag_levels ='A') &theme(legend.position='top')
3 Part 2
3.1 Introduction
This section aims to investigate which environmental variables (salt, ph, etc.) are more accurate at predicting coral bleaching: present variables, or those measured four years prior.
3.2 Formulating problem
Research question: How do the accuracy of bleaching predictions from current vs four year old variables compare?
Hypothesis: The accuracy of bleaching predictions from current variables will be higher than four year old variables
3.3 Exploratory Data Analysis
3.3.1 Extracting NCDF data
The explanatory variables of interest (temperature, salt, pH, nitrogen and algae), are contained within the eReefs dataset. This is a distinct and separate dataset to the bleaching dataset from part 1. This data is stored as a netCDF (nCDF) file, a common format for multidimensional scientific data and variables (DATA3888 2025). We can extract the variables from the nCDF (DATA3888 2025), and store it as the standard R dataframe, for ease of use.
Code
eReefs_nc = ncdf4::nc_open("data/eReefs_assessment.nc")# Longitude and Latitudelat = ncdf4::ncvar_get(eReefs_nc, "latitude")long = ncdf4::ncvar_get(eReefs_nc, "longitude")# Timetime = ncdf4::ncvar_get(eReefs_nc, "time")tunits = ncdf4::ncatt_get(eReefs_nc, "time", "units")cf = CFtime::CFtime(tunits$value, calendar ="standard", time) # convert time to CFtime classtimestamps = CFtime::as_timestamp(cf) # get character-string timestimestamps =as.Date(timestamps, format ="%Y-%m-%d") # convert string to date object instead of the epoch format# Explanatory variablestemp = ncdf4::ncvar_get(eReefs_nc, "temp")salt = ncdf4::ncvar_get(eReefs_nc, "salt")nitrogen = ncdf4::ncvar_get(eReefs_nc, "TOTAL_NITROGEN")algae = ncdf4::ncvar_get(eReefs_nc, "MA_N")ph = ncdf4::ncvar_get(eReefs_nc, "PH")# Convert data to data.frameeReefs_df =expand.grid(long = long, lat = lat, time = timestamps) %>%mutate(temp =as.vector(temp), salt =as.vector(salt), nitrogen =as.vector(nitrogen),algae =as.vector(algae), ph =as.vector(ph))eReefs_noNan = eReefs_df %>%drop_na()
3.3.2 Joining the dataframes
Now that the variables are extracted into a dataframe, the eReefs data is now stored in the same format as the bleaching data from part 1. For ease of manipulation and simplicity, we can join the eReefs and bleaching datasets, into a single joined dataframe, that contains both explanatory variables, and their matching bleaching data. This way, we can use a single dataframe to predict bleaching data from explanatory variables.
To achieve this, rows in eReefs and the bleaching dataset of the same year are matched (DATA3888 2025), so long as the reefs the two rows correspond to are within a distance of 1000 units. For the variables from four years ago, this process is repeated, with rows matched where the variables were measured four years before the bleaching data.
Using explanatory variables (salt, ph, etc.) to predict a category (bleaching), can be achieved with machine learning (ML) models. These models are trained to identify patterns between the variables and the class, such that given a collection of variables for a particular row/instance, the model can predict which class the instance belongs to. Specifically, given values for salt, ph, algae, temperature and nitrogen, the model predicts which category of bleaching a reef of these measurements would experience.
One way to decrease model complexity, and improve performance, is to perform principal component analysis (PCA). PCA finds multiple linear combinations of the variables, in order to find the combination(s) that explain highest amount of variation in the dataset (i.e: are the best at accurately predicting the class). This can help reduce the number of variables we use in our models, hence lowering computational cost. For example, if we have a dataset with 20 variables, and the first 3 principal components explain 90% of the total variance, we may be able to train our model using just those 3 components instead of all 20, vastly increasing efficiency.
Performing PCA on the environmental variables, we find that the first three principal components account for 94% of the variance in the contemporary dataset, and 90% of the variance in the dataset from four years prior. We can then select these three principal components to train models with.
We deploy three ML models: k-nearest neighbor (kNN), a support vector machine (SVM), and a random forest (RF). kNN determines the class of a data point by looking at the majority class of it’s k nearest neighbors. Here, “nearness” is quantified using a measurement of distance, where data points with similar explanatory variables having a low distance between each other.
SVM classifies data by finding the optimal decision boundary (represented as a line, or hyperplane), that maximizes the distance between each class. It’s goal is to separate as many classes as possible, making it easier to identify each one. Decision trees (DTs) build a hierarchical structure of decisions, that classify an example based on it’s variables. Random forests generate an ensemble of multiple DTs to improve prediction accuracy.
To improve accuracy, models will be trained and evaluated using 5-fold cross validation (CV). 5-fold CV randomly splits the dataset into 5 subsets (folds), whereby the model is trained on 4 folds, and evaluated on the 1 remaining fold. This process is repeated 5 times, using a different fold for evaluation each time.
Upon training and evaluation the models using CV, the following accuracy results were obtained. Model accuracy is defined as the percentage of correct class predictions, i.e: the proportion of predictions where a reef’s predicted bleaching matches it’s actual measured bleaching.
Across both datasets, current and four years prior, RF was the most accurate model, followed by kNN and then SVM. Models built on the contempoary dataset, had a slightly higher average prediction accuracy (45%) than the four years prior dataset (43%). These findings are consistent with the hypothesis.
Accuracy across all models in both datasets was low (44%). The bleaching event investigated (2015-17), was unprecedented and highly unpredictable, which may explain the low prediction accuracy. In addition, the spatial distance used to join the eReefs and bleaching datasets may have been too high, potentially matching environmental data that does not accurately reflect the conditions at a specific bleaching site. To mitigate this, the spatial distance could be reduced, potentially improving accuracy.
3.5.3 Contempoary vs prior data
As discovered above, models built on the contemporary dataset, had a higher prediction accuracy than the four years prior dataset. Perhaps because coral bleaches in real-time, largely based off of the environmental factors of the present, not four years ago.
3.5.4 Accuracy variation
All three models vary in accuracy. This may be due to the nature of each model. RF was consistently the most accurate model, perhaps due to the fact it is an amalgamation of individual decision trees, each of which is a model itself. This is not the case for kNN and SVM, which are simply individual models.
3.5.5 Accuracy as an incomplete metric
Furthermore, accuracy might not be the best metric to evaluate model performance. The class that is predicted, the bleaching category (bleachCat) is an ordinal variable ranging from 0 to 5. Therefore, if a reefs bleaching category is 5, predicting a 1 is worse than predicting a 3, since 3 is closer to 5. However, accuracy alone, a binary comparison of whether a prediction is equal to the actual value, cannot capture such nuance. Hence, accuracy may be an incomplete view of the model’s true performance, due to the nature of the data. To improve this, models and/or evaluation methods built for ordinal data, may be a more useful avenue to explore.
---title: "Discipline Assignment 1"date: "today"author: "520514492"bibliography: [refs/refs.bib]format: html: embed-resources: true code-fold: true code-tools: true table-of-contents: true number-sections: true ---```{r setup, include = FALSE}library(tidyverse)library(patchwork) library(plotly)library(maps)library(crosstalk)library(htmltools)library(sf) library(lmtest) library(class)library(e1071) library(randomForest) library(cvTools) ```# Executive summaryPart 1 of this report aims to answer the research question: how does the distribution of coral bleaching vary across latitude zones during the 2014–2017 global bleaching event? To achieve this, an interactive scatter map visualization is created, which finds that across all years of the event, the highest percentage of bleaching occurred in the near-equatorial and equatorial zones. In particular years, this fact remains true for 2014 and 2016. In 2015, however, bleaching was highest in tropical midlatitudes, and in 2017, bleaching peaked in tropical lowlatitudes.Part 2 of this report aims to answer the research question: how do the accuracy of bleaching predictions from current vs four year old variables compare? To achieve this, multiple machine learning (ML) models were trained and deployed. The findings were that models built on the contemporary dataset, had a slightly higher average prediction accuracy (45%) than the four years prior dataset (43%).This report was formatted in `Quatro`[@Quatro]. `R`[@R] is used most of the report's basic numerical analysis. The following packages were also used: the `Tidyverse` collection [@tidyverse], for visualization; `Patchwork`[@patchwork], to format graphs; `Plotly`[@plotly], `Maps`[@maps], `Crosstalk`[@crosstalk] and `HTMLtools`[@htmltools], to create interactive graphs; `Simple Features (sf)`[@sf], to transform NCDF data; `lmtest`[@lmtest] to perform the Durbin-Watson test; and `class`[@class], `e1071`[@e1071], `randomForest`[@randomForest] and `cvTools`[@cvTools], to train and deploy ML models.# Part 1## IntroductionThis section investigates a statement made by the authors of a study [@VarDesc] that examines coral bleaching events from 1998 to 2017 across in 81 countries. They claim that “the highest probability of coral bleaching occurred at tropical midlatitude sites (15–20 degrees north and south of the Equator)”.## Formulating problemFrom this claim, we derive the following research question and hypothesis:**Research question**: *How does the distribution of coral bleaching vary across latitude zones during the 2014–2017 global bleaching event?* **Hypothesis**: *The tropical midlatitude zone (15–20°) will show the highest average bleaching levels compared to other latitude zones between 2014 and 2017.*The 2014-2017 period is of special interest, as it represents on of the most intense coral bleaching events on record.## Exploratory Data Analysis### Data cleaningThe various environmental measurements recorded throughout the study - such as temperature, and average bleaching - are contained in an associated dataset. The data stores location as latitude and longitude, which can be transformed into the following five equatorial zones [@latZones]: Equatorial (0-5°), Near-Equatorial (5-10°), Tropical Lowlatitude (10-15°), Tropical Midlatitude (15-20°), Tropical Highlatitude (20-25°), and Subtropical (25-35°).Missing values, denoted by 'NA', are also removed, to ensure subsequent operations occur on a complete and consistent dataset. These values could represent sensor failure or missed data.```{r}p1 <-read.csv("data/Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv")p1data <- p1 %>%drop_na() %>%filter(Year >=2014& Year <=2017) %>%arrange(Average_bleaching) %>%mutate(Latitude_Zone =case_when(abs(Latitude.Degrees) >=25&abs(Latitude.Degrees) <35~"Subtropical (25-35°)",abs(Latitude.Degrees) >=20&abs(Latitude.Degrees) <25~"Tropical Highlatitude (20-25°)",abs(Latitude.Degrees) >=15&abs(Latitude.Degrees) <20~"Tropical Midlatitude (15-20°)",abs(Latitude.Degrees) >=10&abs(Latitude.Degrees) <15~"Tropical Lowlatitude (10-15°)",abs(Latitude.Degrees) >=5&abs(Latitude.Degrees) <10~"Near-Equatorial (5-10°)",abs(Latitude.Degrees) >=0&abs(Latitude.Degrees) <5~"Equatorial (0-5°)" ) )p1data$Latitude_Zone <-factor(p1data$Latitude_Zone, levels =c("Equatorial (0-5°)","Near-Equatorial (5-10°)","Tropical Lowlatitude (10-15°)","Tropical Midlatitude (15-20°)","Tropical Highlatitude (20-25°)","Subtropical (25-35°)"))p1data %>%select(Reef.Name, Country, Year, Average_bleaching, Temperature_Mean, Latitude.Degrees, Latitude_Zone) %>%head(6) %>% knitr::kable(caption ="Sample of Bleaching Dataset")bleachingData = p1data```### Data distributionBefore assessing the paper's claim about bleaching, the distribution of bleaching in the dataset must be investigated. Below is a boxplot visualizing the distribution of Average bleaching in each of the latitude zones that were defined in the earlier section. `Average_bleaching` is a measure of the bleaching percentage of a particular reef. The y axis, 'Log Average Bleaching', is a log transformation of `Average_bleaching`. This reduces variable's range from 100 to 2, neatening the plot. It is evident from the graph, that the average bleaching is relatively evenly distributed across the zones.```{r, warning=FALSE}ggplot(p1data, aes(x = Latitude_Zone, y = Average_bleaching)) + geom_boxplot(outlier.shape = NA, fill = "white", color = "black") + scale_y_log10() + labs(title = "Distribution of Average Bleaching by Latitude Zone", x = "Latitude Zone", y = "Log Average Bleaching (%)") + theme_minimal() + theme(axis.text.x = element_text(angle = 30, hjust = 1), plot.title = element_text(face = "bold") )```## Executing solution### Initial visualizationThe following is a barplot of the average bleaching in each Latitude Zone from 2014 to 2017. The general trend is that bleaching percentage generally decreased from 2014-17, with the Near-Equatorial zone having the highest bleaching among all zones in 2016 and 2017. In addition, Tropical Lowlatitiude and Midlatitudes peaked in 2014 and 2015, respectively.```{r}ggplot(p1data, aes(x = Year, y = Average_bleaching, fill = Latitude_Zone)) +geom_col(position ="dodge") +scale_fill_brewer(palette ="Set2") +labs(x ="Year", y ="Average Bleaching (%)",fill ="Latitude Zone", title ="Average Bleaching by Latitude Zone (2014–2017)") +theme_minimal() +theme(plot.title =element_text(face ="bold") )```### Interactive visualizationWhile a barplot is useful, in order to more intuitively detect patterns of bleaching among the latitude zones, we can use a scatter map visualization.Below is an interactive geoscatter visualization, adapted from example `Plotly` code [@geoExample]. We load the map data from `plotly`, which gives us the canvas to 'draw' bleaching points on. The points are formatted, given a color scale, and then filters for year, zone, and average bleaching are added to the top of the plot.From the plot as default, it is evident that the highest percentage of bleaching occurred in the near-equatorial and equatorial zones. Adjusting the 'year' filter, it is clear that this fact remains true for years 2014 and 2016. In 2015, however, bleaching was highest in tropical midlatitudes, concentrated around the east coast of australia. In 2017, bleaching peaked in tropical lowlatitudes.```{r}shared_data <- SharedData$new(p1data)# Plotly map map <-plot_ly(data = shared_data, type ="scattergeo", mode ="markers",lat =~Latitude.Degrees, lon =~Longitude.Degrees,marker =list(size =6, opacity =0.8, color =~Average_bleaching,colorscale ="Magma", colorbar =list(title ="Average Bleaching (%)")),text =~paste("Year:", Year, "Zone:", Latitude_Zone, "Bleaching:", Average_bleaching, "%"), hoverinfo ="text") %>% plotly::layout(title =list(text ="Map of Coral Bleaching"),geo =list(showland =TRUE, landcolor ="#416f22", showcountries =TRUE,showocean =TRUE, oceancolor ="#1a2845", projection =list(type ="equirectangular") ) )# filtersfilters <-bscols(widths =c(2, 3, 4),filter_select("year_filter", "Year", shared_data, ~Year),filter_select("zone_filter", "Latitude Zone", shared_data, ~Latitude_Zone),filter_slider("bleach_filter", "Average Bleaching (%)", shared_data, ~Average_bleaching, width ="100%"))# Renderbrowsable(tagList(filters, map))```## Evaluation and improvements### Comparing against hypothesisInsights from the visualizations, do not fully support the hypothesis. The tropical midlatitude had the highest average bleaching only once out of the four years, in 2015, with equatorial zones having higher peaks in 2014 and 2016, and tropical lowlatitudes dominating in 2017. This variability suggests that the claim made — that midlatitudes experienced the highest probability of bleaching — does not hold uniformly during the 2014–2017 bleaching event. Several factors could explain this inconsistency, which are explored below.### Statistical significanceThere are clearly differences in bleaching between across the latitude zones, but are these differences *statistically significant*? I.e: did they occur beyond pure chance? To ascertain this, we could perform an ANOVA test. This would compare the average bleaching across each zone to determine whether the differences are significant.### Temporal correlationIn addition, treating and analysing each year of bleaching as if they are independent from the others may be incorrect. To investigate this, a Durbin Watson test [@durbWtsn] was performed on a linear model of average bleaching against year. The Durbin-Watson (DW) statistic obtained was 0.003, indicating a strong positive autocorrelation. This suggests bleaching values in one year are highly correlated with values in adjacent years. Therefore, treating each year's bleaching data as independent observations may lead to incorrect conclusions.```{r}dwtest(lm(Average_bleaching ~ Year, data = p1data))```### Sample distributionIn addition, not all latitude zones were equally sampled in the dataset. The tropical lowlatitude, highlatitude, and subtropical zones were particularly under-sampled compared to others, possibly due to the lower prevalence of reefs in those regions. This uneven distribution of observations could act as a confounding factor, which limits the accuracy of comparisons across zones. That is, if one zone has significantly more measurements than another, we cannot be confident that observed differences in bleaching are wholly due to the zone itself, instead of differences in sample size. ```{r}plot1 <-ggplot(p1data, aes(x = Latitude_Zone)) +geom_bar(fill ="darkgray") +labs(title ="Observations per Latitude Zone",x ="Latitude Zone", y ="Count") +theme_minimal() +theme(axis.text.x =element_text(angle =30, hjust =1))plot2 <-ggplot(p1data, aes(x = Year)) +geom_bar(fill ="darkgreen") +labs(title ="Observations per year",x ="Year", y ="Count") +theme_minimal() plot1 + plot2 +plot_layout(guides ='collect', ncol =2, nrow =1) +plot_annotation(tag_levels ='A') &theme(legend.position='top')```# Part 2## IntroductionThis section aims to investigate which environmental variables (salt, ph, etc.) are more accurate at predicting coral bleaching: present variables, or those measured four years prior.## Formulating problem**Research question**: *How do the accuracy of bleaching predictions from current vs four year old variables compare?* **Hypothesis**: *The accuracy of bleaching predictions from current variables will be higher than four year old variables*## Exploratory Data Analysis### Extracting NCDF dataThe explanatory variables of interest (temperature, salt, pH, nitrogen and algae), are contained within the `eReefs` dataset. This is a distinct and separate dataset to the bleaching dataset from part 1. This data is stored as a netCDF (nCDF) file, a common format for multidimensional scientific data and variables [@Lab1]. We can extract the variables from the nCDF [@Lab1], and store it as the standard R dataframe, for ease of use.```{r}eReefs_nc = ncdf4::nc_open("data/eReefs_assessment.nc")# Longitude and Latitudelat = ncdf4::ncvar_get(eReefs_nc, "latitude")long = ncdf4::ncvar_get(eReefs_nc, "longitude")# Timetime = ncdf4::ncvar_get(eReefs_nc, "time")tunits = ncdf4::ncatt_get(eReefs_nc, "time", "units")cf = CFtime::CFtime(tunits$value, calendar ="standard", time) # convert time to CFtime classtimestamps = CFtime::as_timestamp(cf) # get character-string timestimestamps =as.Date(timestamps, format ="%Y-%m-%d") # convert string to date object instead of the epoch format# Explanatory variablestemp = ncdf4::ncvar_get(eReefs_nc, "temp")salt = ncdf4::ncvar_get(eReefs_nc, "salt")nitrogen = ncdf4::ncvar_get(eReefs_nc, "TOTAL_NITROGEN")algae = ncdf4::ncvar_get(eReefs_nc, "MA_N")ph = ncdf4::ncvar_get(eReefs_nc, "PH")# Convert data to data.frameeReefs_df =expand.grid(long = long, lat = lat, time = timestamps) %>%mutate(temp =as.vector(temp), salt =as.vector(salt), nitrogen =as.vector(nitrogen),algae =as.vector(algae), ph =as.vector(ph))eReefs_noNan = eReefs_df %>%drop_na()```### Joining the dataframesNow that the variables are extracted into a dataframe, the eReefs data is now stored in the same format as the bleaching data from part 1. For ease of manipulation and simplicity, we can join the eReefs and bleaching datasets, into a single joined dataframe, that contains both explanatory variables, and their matching bleaching data. This way, we can use a single dataframe to predict bleaching data from explanatory variables.To achieve this, rows in eReefs and the bleaching dataset of the same year are matched [@Lab1], so long as the reefs the two rows correspond to are within a distance of 1000 units. For the variables from four years ago, this process is repeated, with rows matched where the variables were measured four years before the bleaching data. ```{r}# #| cache: TRUE# eReefs_noNan = eReefs_df %>% drop_na()# # bleaching_sf = st_as_sf(bleachingData, coords = c("longitude", "latitude"), crs = 4326)# eReefs_sf = st_as_sf(eReefs_noNan, coords = c("long", "lat"), crs = 4326)# # eReefs_sf <- eReefs_sf %>%# mutate(year_eReef = year(time))# # bleaching_sf <- bleaching_sf %>%# mutate(year_bleach = year(surveyDate))# # joined_sf = st_join(bleaching_sf, eReefs_sf, join = st_is_within_distance, dist = 1000, left = TRUE)# # joined_df = joined_sf |># as.data.frame() |> drop_na() |># filter(year_eReef == year_bleach)# # joined_df_4yrago <- joined_sf %>% filter(year_bleach - year_eReef == 4) %>% as.data.frame()# saveRDS(joined_df, "joined_df")# saveRDS(joined_df_4yrago, "joined_df_4yrago")extract_coords <-function(data) { coords =st_coordinates(data$geometry) long_values = coords[, 1] lat_values = coords[, 2] data = data |>mutate(longitude = long_values,latitude = lat_values)}joined_df <-readRDS("joined_df")joined_df_4yrago <-readRDS("joined_df_4yrago")joined_df <-extract_coords(joined_df)joined_df_4yrago <-extract_coords(joined_df_4yrago)```## Executing solution### Principle Component AnalysisUsing explanatory variables (salt, ph, etc.) to predict a category (bleaching), can be achieved with machine learning (ML) models. These models are trained to identify patterns between the variables and the class, such that given a collection of variables for a particular row/instance, the model can predict which class the instance belongs to. Specifically, given values for salt, ph, algae, temperature and nitrogen, the model predicts which category of bleaching a reef of these measurements would experience.One way to decrease model complexity, and improve performance, is to perform principal component analysis (PCA). PCA finds multiple linear combinations of the variables, in order to find the combination(s) that explain highest amount of variation in the dataset (i.e: are the best at accurately predicting the class). This can help reduce the number of variables we use in our models, hence lowering computational cost. For example, if we have a dataset with 20 variables, and the first 3 principal components explain 90% of the total variance, we may be able to train our model using just those 3 components instead of all 20, vastly increasing efficiency.Performing PCA on the environmental variables, we find that the first three principal components account for 94% of the variance in the contemporary dataset, and 90% of the variance in the dataset from four years prior. We can then select these three principal components to train models with.```{r, results='asis'}set.seed(3888)perf_pca <- function(data, msg){ pca_model <- prcomp(data[, c("temp", "nitrogen", "salt", "algae", "ph")], scale. = TRUE) tab <- knitr::kable(as.data.frame(summary(pca_model)$importance ), caption = paste("PCA Component Importance", msg)) print(tab) pca_data <- as.data.frame(pca_model$x[, 1:3]) colnames(pca_data) <- c("PC1", "PC2", "PC3") pca_data$bleachCat <- data$bleachCat return(pca_data)}pca_res <- perf_pca(joined_df, "(contempoary)")pca_res_4y_ago <- perf_pca(joined_df_4yrago, "(4y ago)")```### Model deploymentWe deploy three ML models: k-nearest neighbor (kNN), a support vector machine (SVM), and a random forest (RF). kNN determines the class of a data point by looking at the majority class of it's k nearest neighbors. Here, "nearness" is quantified using a measurement of distance, where data points with similar explanatory variables having a low distance between each other. SVM classifies data by finding the optimal decision boundary (represented as a line, or hyperplane), that maximizes the distance between each class. It's goal is to separate as many classes as possible, making it easier to identify each one. Decision trees (DTs) build a hierarchical structure of decisions, that classify an example based on it's variables. Random forests generate an ensemble of multiple DTs to improve prediction accuracy.To improve accuracy, models will be trained and evaluated using 5-fold cross validation (CV). 5-fold CV randomly splits the dataset into 5 subsets (folds), whereby the model is trained on 4 folds, and evaluated on the 1 remaining fold. This process is repeated 5 times, using a different fold for evaluation each time.```{r}set.seed(3888)cv <-function(data, n_sim =50, cvK =5, k_knn =5) { cv_50acc5_knn = cv_50acc5_svm = cv_50acc5_rf =c() X = data[, c("PC1", "PC2", "PC3")] y = data$bleachCat for (i in1:n_sim) { cvSets =cvFolds(nrow(X), cvK) cv_acc_knn = cv_acc_svm = cv_acc_rf =c()for (j in1:cvK) { test_id = cvSets$subsets[cvSets$which == j] X_test = X[test_id, ] X_train = X[-test_id, ] y_test = y[test_id] y_train = y[-test_id] knn_pred <-knn(train = X_train, test = X_test, cl = y_train, k = k_knn) cv_acc_knn[j] =mean(knn_pred == y_test) svm_res <-svm(x = X_train, y =as.factor(y_train)) svm_pred <-predict(svm_res, X_test) cv_acc_svm[j] =mean(svm_pred == y_test) rf_res <-randomForest(x = X_train, y =as.factor(y_train)) rf_pred <-predict(rf_res, X_test) cv_acc_rf[j] =mean(rf_pred == y_test) } cv_50acc5_knn <-append(cv_50acc5_knn, mean(cv_acc_knn)) cv_50acc5_svm <-append(cv_50acc5_svm, mean(cv_acc_svm)) cv_50acc5_rf <-append(cv_50acc5_rf, mean(cv_acc_rf)) }return(list(knn_accuracy =mean(cv_50acc5_knn),svm_accuracy =mean(cv_50acc5_svm),rf_accuracy =mean(cv_50acc5_rf) ))}```## Evaluation and improvements### Comparing against hypothesisUpon training and evaluation the models using CV, the following accuracy results were obtained. Model accuracy is defined as the percentage of correct class predictions, i.e: the proportion of predictions where a reef's predicted bleaching matches it's actual measured bleaching.Across both datasets, current and four years prior, RF was the most accurate model, followed by kNN and then SVM. Models built on the contempoary dataset, had a slightly higher average prediction accuracy (45%) than the four years prior dataset (43%). These findings are consistent with the hypothesis. ```{r}set.seed(3888)cv_results <-cv(pca_res)cv_results_4y_ago <-cv(pca_res_4y_ago) cv_results_to_df <-function(data){ cv_df <-data.frame(Model =c("KNN", "SVM", "Random Forest"),Accuracy =c(data$knn_accuracy*100, data$svm_accuracy*100, data$rf_accuracy*100))}cv_df <-cv_results_to_df(cv_results)cv_df_4y_ago <-cv_results_to_df(cv_results_4y_ago)plot1 <-ggplot(cv_df, aes(x = Model, y = Accuracy, fill = Model)) +geom_col() +geom_text(aes(label =round(Accuracy, 2)), vjust =-0.5) +labs(title ="Model Accuracy (contempoary)", x ="Model", y ="Accuracy (%)") +scale_y_continuous(limits =c(0, 60)) +theme_minimal() +scale_fill_brewer(palette ="Set1") plot2 <-ggplot(cv_df_4y_ago, aes(x = Model, y = Accuracy, fill = Model)) +geom_col() +geom_text(aes(label =round(Accuracy, 2)), vjust =-0.5) +labs(title ="Model Accuracy (4Y Ago)", x ="Model", y ="Accuracy (%)") +scale_y_continuous(limits =c(0, 60)) +theme_minimal() +scale_fill_brewer(palette ="Set1") plot1 + plot2 +plot_layout(guides ='collect', ncol =2, nrow =1) +plot_annotation(tag_levels ='A') &theme(legend.position='top')```### Generally low accuracyAccuracy across all models in both datasets was low (44%). The bleaching event investigated (2015-17), was unprecedented and highly unpredictable, which may explain the low prediction accuracy. In addition, the spatial distance used to join the eReefs and bleaching datasets may have been too high, potentially matching environmental data that does not accurately reflect the conditions at a specific bleaching site. To mitigate this, the spatial distance could be reduced, potentially improving accuracy.### Contempoary vs prior dataAs discovered above, models built on the contemporary dataset, had a higher prediction accuracy than the four years prior dataset. Perhaps because coral bleaches in real-time, largely based off of the environmental factors of the present, not four years ago. ### Accuracy variationAll three models vary in accuracy. This may be due to the nature of each model. RF was consistently the most accurate model, perhaps due to the fact it is an amalgamation of individual decision trees, each of which is a model itself. This is not the case for kNN and SVM, which are simply individual models. ### Accuracy as an incomplete metricFurthermore, accuracy might not be the best metric to evaluate model performance. The class that is predicted, the bleaching category (`bleachCat`) is an ordinal variable ranging from 0 to 5. Therefore, if a reefs bleaching category is 5, predicting a 1 is worse than predicting a 3, since 3 is closer to 5. However, accuracy alone, a binary comparison of whether a prediction is equal to the actual value, cannot capture such nuance. Hence, accuracy may be an incomplete view of the model's true performance, due to the nature of the data. To improve this, models and/or evaluation methods built for ordinal data, may be a more useful avenue to explore. # References